Empirical Evidence for Hilberg’s Conjecture in Single-Author Texts

نویسنده

  • Łukasz Dębowski
چکیده

Hilberg’s conjecture is a statement that the mutual information between two adjacent blocks of text in natural language scales as n , where n is the block length. Previously, this hypothesis has been linked to Herdan’s law on the levels of word frequency and of text semantics. Thus it is worth a direct empirical test. In the present paper, Hilberg’s conjecture is tested for a selection of English prose using the Lempel-Ziv algorithm. An upper bound for the exponent β is found to be 0.949.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hilberg’s Conjecture — a Challenge for Machine Learning

We review three mathematical developments linked with Hilberg’s conjecture—a hypothesis about the power-law growth of entropy of texts in natural language, which sets up a challenge for machine learning. First, considerations concerning maximal repetition indicate that universal codes such as the Lempel-Ziv code may fail to efficiently compress sources that satisfy Hilberg’s conjecture. Second,...

متن کامل

Hilberg’s Conjecture: an Updated FAQ

This note is a brief introduction to theoretical and experimental results concerning Hilberg’s conjecture, a hypothesis about natural language. The aim of the text is to provide a short guide to the literature. 1 What is Hilberg’s conjecture? In the early days of information theory, Shannon (1951) published estimates of conditional entropy for printed English. A few decades later, Hilberg (1990...

متن کامل

A New Universal Code Helps to Distinguish Natural Language from Random Texts

Using a new universal distribution called switch distribution, we reveal a prominent statistical difference between a text in natural language and its unigram version. For the text in natural language, the cross mutual information grows as a power law, whereas for the unigram text, it grows logarithmically. In this way, we corroborate Hilberg’s conjecture and disprove an alternative hypothesis ...

متن کامل

A Preadapted Universal Switch Distribution for Testing Hilberg's Conjecture

Hilberg’s conjecture states that the mutual information between two adjacent long blocks of text in natural language grows like a power of the block length. The exponent in this hypothesis can be upper bounded using the pointwise mutual information computed for a carefully chosen code. The bound is the better, the lower the compression rate is but there is a requirement that the code be univers...

متن کامل

A Preadapted Universal Switch Distribution for Testing Hilberg's Conjecture

Hilberg’s conjecture states that the mutual information between two adjacent long blocks of text in natural language grows like a power of the block length. The exponent in this hypothesis can be upper bounded using the pointwise mutual information computed for a carefully chosen code. The bound is the better, the lower the compression rate is but there is a requirement that the code be univers...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012